Skip to content

feat(eval): Add skill evaluation framework#5

Merged
saadshahd merged 2 commits intomainfrom
feat/skill-evaluations
Dec 16, 2025
Merged

feat(eval): Add skill evaluation framework#5
saadshahd merged 2 commits intomainfrom
feat/skill-evaluations

Conversation

@saadshahd
Copy link
Copy Markdown
Owner

Summary

  • CI workflow runs skill auto-triggering tests on PRs touching plugin code
  • Uses claude-code-action with --plugin-dir to load plugins from checked-out repo
  • Local testing via ./eval/run.sh --simple
  • 5 test cases covering hope:gate, hope:soul, hope:trace, product, wordsmith

Test plan

  • CI workflow triggers and completes
  • At least one test returns VERDICT: PASS
  • Local ./eval/run.sh --simple hope-gate-completion works

- CI workflow runs on PRs touching plugin code
- Uses claude-code-action with --plugin-dir to load plugins
- Local testing via ./eval/run.sh --simple
- 5 test cases for skill auto-triggering
- JSON schema for structured output (future API key support)
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Dec 16, 2025

Greptile Overview

Greptile Summary

Adds comprehensive skill evaluation framework using claude-code-action for automated testing of skill auto-triggering.

  • CI workflow runs 5 test cases in parallel matrix strategy on PRs touching plugin code
  • Local testing via ./eval/run.sh --simple with OAuth support
  • Well-structured test cases for hope:gate, hope:soul, hope:trace, product, and wordsmith skills
  • Claude self-evaluates against expected behaviors, enforcing structured output via JSON schema
  • Clear documentation for both developers (docs/dev/evaluations.md) and users (eval/README.md)
  • CHANGELOG.md properly updated following project conventions

Confidence Score: 4/5

  • This PR is safe to merge with low risk - adds testing infrastructure without touching production code
  • Score reflects well-designed evaluation framework with good documentation and proper error handling. Minor consideration around PARTIAL verdict handling in CI (currently passes but could be stricter). All test cases follow consistent structure, shell script is clean, and changes align with CLAUDE.md conventions including proper CHANGELOG entry.
  • No files require special attention - all files are well-structured with appropriate documentation

Important Files Changed

File Analysis

Filename Score Overview
.github/workflows/eval.yml 4/5 Adds CI workflow for skill auto-triggering tests using claude-code-action with matrix strategy for parallel test execution
eval/run.sh 5/5 Local test runner with --simple flag for OAuth compatibility and help documentation
eval/schema.json 5/5 JSON schema for structured evaluation output with verdict, behaviors, and evidence tracking
docs/dev/evaluations.md 5/5 Developer documentation explaining evaluation philosophy, architecture, and rationale for tool choices

Sequence Diagram

sequenceDiagram
    participant PR as Pull Request
    participant GH as GitHub Actions
    participant CCA as claude-code-action
    participant Claude as Claude Code
    participant Plugins as Plugin Skills
    participant YAML as Test Case YAML

    PR->>GH: Trigger on plugin/** or eval/** changes
    GH->>GH: Matrix strategy: spawn 5 parallel jobs
    
    loop For each test case
        GH->>CCA: Run with OAuth token
        CCA->>Claude: Initialize with --plugin-dir flags
        Claude->>Plugins: Load hope, product, wordsmith, founder, career
        
        CCA->>Claude: Send eval prompt
        Claude->>YAML: Read test case file
        YAML-->>Claude: Extract prompt + expected_behaviors
        
        Claude->>Claude: Process prompt (skills auto-trigger)
        Claude->>Plugins: Skills evaluate trigger conditions
        Plugins-->>Claude: Skill activates (or doesn't)
        
        Claude->>Claude: Self-evaluate response vs expected_behaviors
        Claude->>Claude: Generate verdict (PASS/PARTIAL/FAIL)
        Claude-->>CCA: Return result with verdict
        
        CCA-->>GH: Output result
        GH->>GH: Parse verdict with grep
        
        alt VERDICT: PASS
            GH->>GH: Print success ✓
        else VERDICT: PARTIAL
            GH->>GH: Print warning ⚠
        else VERDICT: FAIL or no verdict
            GH->>GH: Fail job with error
        end
    end
    
    GH->>PR: Report CI status
Loading

Copy link
Copy Markdown

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +72 to +73
elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
echo "⚠ $TEST_NAME: PARTIAL"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: VERDICT: PARTIAL exits with success but may indicate issues. Consider if partial verdicts should fail CI to maintain quality gates.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/eval.yml
Line: 72:73

Comment:
**style:** `VERDICT: PARTIAL` exits with success but may indicate issues. Consider if partial verdicts should fail CI to maintain quality gates.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@claude
Copy link
Copy Markdown

claude bot commented Dec 16, 2025

PR Review: Skill Evaluation Framework

Thank you for this well-structured addition! The evaluation framework is a solid approach to ensuring skill quality.

Strengths

Architecture & Design

  • Self-evaluation approach is elegant and aligns well with Claude Code philosophy
  • Matrix strategy in CI allows parallel test execution - good for performance
  • fail-fast: false ensures all tests run even if one fails - excellent for visibility
  • Clear separation between simple mode (OAuth) and full mode (API key) shows good UX thinking
  • Test case schema (YAML) is well-structured and extensible

Documentation

  • Excellent docs in eval/README.md and docs/dev/evaluations.md
  • Clear rationale for why NOT using promptfoo/SDK (docs/dev/evaluations.md:96-106)
  • CHANGELOG.md properly updated
  • CLAUDE.md updated with evaluation instructions

Code Quality

  • Shell script uses set -euo pipefail - excellent error handling
  • Proper use of SCRIPT_DIR for path resolution
  • Help function is clear and useful
  • Good use of 2>/dev/null to suppress errors

Issues & Recommendations

1. CRITICAL: Test Case Date Inconsistency

Location: All test YAML files (eval/cases/skill-triggers/*.yaml:26)

All test files show created: 2024-12-15 but the current date is 2025-12-16. This appears to be a typo.

Fix: Update to 2025-12-15 in all 5 test files.

2. CI Workflow: Missing Plugin Directory Check

Location: .github/workflows/eval.yml:56

The workflow references --plugin-dir ./founder --plugin-dir ./career but there is no validation that plugin directories are valid.

Recommendation: Add a validation step before running tests.

3. Shell Script: Partial Test Name Matching

Location: eval/run.sh:49

Documentation shows ./eval/run.sh --simple hope-gate but the actual test is hope-gate-completion.yaml. This will fail.

Fix: Either update docs to use full names, OR add fuzzy matching for partial test names.

4. Schema: reasoning Should Be Required

Location: eval/schema.json:45

The reasoning field is optional, but the CI prompt says explain your reasoning briefly. This could cause validation failures.

Fix: Add reasoning to required fields array.

5. Test Coverage: Missing Negative Cases

Location: eval/cases/skill-triggers/

All 5 test cases validate positive triggers. No tests for skills that should NOT trigger on certain prompts.

Recommendation: Add negative test cases to ensure skills do not over-trigger.

6. Test Coverage: Missing founder/career Tests

Location: .github/workflows/eval.yml:56

CI workflow loads founder and career plugins but no test cases exist for them.

Fix: Either add test cases or remove from --plugin-dir flags.

7. Security: Potential Log Exposure

Location: .github/workflows/eval.yml:61-66

The full RESULT output is echoed to CI logs. If Claude response contains sensitive repo info, it will be exposed.

Mitigation: Sanitize output or only log verdict line.

8. Performance: Timeout May Be Too Short

Location: .github/workflows/eval.yml:57

timeout_minutes: 5 might be tight for large repos or slow runners.

Recommendation: Increase to 10 minutes.

9. Documentation: Missing Failure Triage Guide

Location: docs/dev/evaluations.md:82-93

Troubleshooting section does not explain how to debug failed CI tests locally.

Recommendation: Add Debugging Failed Tests section with step-by-step instructions.

Summary

Code Quality: 4/5
Test Coverage: 3/5 (Missing negative cases and 2 plugins)
Documentation: 5/5 (Excellent!)
Security: 4/5 (Minor log exposure risk)
Performance: 4/5 (Timeout could be longer)

Recommended Actions

Must Fix:

  1. Update test case dates from 2024 to 2025
  2. Fix shell script example or add fuzzy matching
  3. Make reasoning required in schema.json

Should Fix:
4. Add plugin validation step to CI
5. Add test cases for founder/career or remove from workflow
6. Increase timeout to 10 minutes
7. Add negative test cases

Nice to Have:
8. Add failure triage guide to docs
9. Collect test artifacts in CI
10. Sanitize CI log output

Overall Assessment

This is a high-quality PR that introduces a well-thought-out evaluation system. The self-evaluation approach is clever and avoids external dependencies. The documentation is exemplary.

With the fixes above (especially the date correction and schema update), this is ready to merge.

Verdict: APPROVE with minor changes requested

- Remove `timeout_minutes` input (not valid for claude-code-action)
- Add `show_full_output: true` for better CI debugging
- Add session_id to debug output
- Improve error messages for execution_file issues
@saadshahd saadshahd merged commit b2b44d0 into main Dec 16, 2025
1 of 6 checks passed
@claude
Copy link
Copy Markdown

claude bot commented Dec 16, 2025

PR Review: Skill Evaluation Framework

Summary

This PR adds a comprehensive skill evaluation framework to test auto-triggering across the moo.md plugin ecosystem. The implementation is well-architected and follows the project's conventions.


Strengths

Architecture & Design

  • Smart approach: Using Claude to evaluate Claude is elegant and aligns with Claude Code's philosophy
  • Flexible testing: Both CI (via claude-code-action) and local testing (via ./eval/run.sh) are supported
  • Parallel execution: Matrix strategy in CI allows concurrent test runs (lines 21-29 in .github/workflows/eval.yml:21-29)
  • Proper fail-fast handling: fail-fast: false ensures all tests run even if one fails (.github/workflows/eval.yml:22)

Code Quality

  • Clean shell script: Proper error handling with set -euo pipefail (eval/run.sh:3)
  • Good UX: Help text and clear usage examples (eval/run.sh:39-52)
  • Documentation: Three levels of docs (README, dev guide, inline comments) provide good coverage

Conventions Adherence

  • CHANGELOG.md updated: Correctly added under [Unreleased] with proper category (CHANGELOG.md:12-15)
  • CLAUDE.md updated: Added eval instructions in appropriate section (CLAUDE.md:41-46)
  • Consistent naming: Test cases use kebab-case as per project conventions

Issues & Recommendations

1. Security: Secrets Exposure Risk

Severity: HIGH

The workflow uses show_full_output: true (.github/workflows/eval.yml:57), which logs the entire Claude output. If Claude accidentally includes sensitive data (API keys, tokens, file contents with secrets), it will appear in CI logs.

Recommendation:

# Remove or set to false
show_full_output: false  # Only show summary

If full output is needed for debugging, add a comment explaining why and ensure skills never process sensitive data.


2. Error Handling: Missing Claude CLI Check

Severity: MEDIUM

The local test script (eval/run.sh) assumes claude CLI is installed but never checks. If missing, users get cryptic 'command not found' errors.

Recommendation (eval/run.sh:5-8):

# Add after SCRIPT_DIR definition
if \! command -v claude &> /dev/null; then
  echo 'Error: claude CLI not found. Install Claude Code first.'
  exit 1
fi

3. Bug: Test Case Schema Mismatch

Severity: MEDIUM

Test YAML files include fields that are not used anywhere:

  • model_tier, timeout_ms (hope-gate-completion.yaml:11-12)
  • evaluators, tags, created (hope-gate-completion.yaml:14-26)

The CI workflow only reads prompt and expected_behaviors (.github/workflows/eval.yml:44-47). This creates maintenance confusion.

Options:

  1. Remove unused fields from YAML files (cleaner)
  2. Document in docs/dev/evaluations.md which fields are reserved for future use
  3. Use the fields (e.g., pass timeout_ms to claude-code-action if supported)

4. Robustness: Partial CI Failures Not Handled

Severity: MEDIUM

The CI exits 0 on VERDICT: PARTIAL but this masks regressions where skills trigger but miss behaviors.

Current (.github/workflows/eval.yml:94-95):

elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
  echo "$TEST_NAME: PARTIAL"  # Exit code 0

Recommendation:

elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
  echo "::warning::$TEST_NAME returned PARTIAL verdict"
  # Optionally: exit 1 to fail the build on partial

Consider adding a workflow input to toggle strict mode (fail on PARTIAL).


5. Performance: Serial Test Execution Locally

Severity: LOW

eval/run.sh runs tests sequentially (eval/run.sh:76-78). With 5+ tests, this is slow.

Recommendation:
Add a --parallel flag using GNU parallel or background jobs.


6. Code Quality: Hardcoded Plugin List

Severity: LOW

Plugin directories are hardcoded in two places:

  • CI workflow (.github/workflows/eval.yml:56)
  • Docs (docs/dev/evaluations.md:32)

If a new plugin is added (e.g., fitness/), developers must remember to update both.

Recommendation:
Generate plugin list dynamically from .claude-plugin/marketplace.json to ensure the plugin list stays in sync.


7. Documentation: Missing Failure Examples

Severity: LOW

README shows successful output but not what failures look like. Developers won't know if FAIL/PARTIAL is expected for work-in-progress skills.

Add to eval/README.md:
Example output for both passing and failing tests.


8. Best Practice: Verdict Pattern Too Loose

Severity: LOW

Verdict detection uses grep -q "VERDICT: PASS" (.github/workflows/eval.yml:92), which matches anywhere in output. If Claude explains 'previous VERDICT: PASS was wrong', the test incorrectly passes.

Recommendation:
Match only at end of output:

if tail -n 5 "$EXECUTION_FILE" | grep -q "^VERDICT: PASS$"; then

Minor Nits

  1. Inconsistent flag naming: Workflow uses --plugin-dir but docs mention it is for 'adding plugins' (eval/run.sh does not use it). Clarify in docs.

  2. Typo in workflow: Line 100 says 'The output may not contain the expected VERDICT line' but should suggest checking the execution file path (.github/workflows/eval.yml:100-101).

  3. JSON schema unused locally: The schema is only used in non---simple mode, but docs do not explain when/why to use full mode (eval/README.md:20-22).


Test Coverage Assessment

Current tests cover:

  • ✅ 3 hope skills (gate, soul, trace)
  • ✅ 1 product skill
  • ✅ 1 wordsmith skill
  • ❌ 0 founder skills
  • ❌ 0 career skills

Recommendation: Add at least one test for founder and career to match the workflow's path triggers (.github/workflows/eval.yml:9-10).


Security Checklist

  • ✅ No secrets in code
  • ⚠️ show_full_output: true could leak secrets (see issue Add Claude Code GitHub Workflow #1)
  • ✅ Workflow permissions minimal (contents: read, id-token: write)
  • ✅ No arbitrary code execution from YAML
  • ✅ Shell script uses safe variable expansion

Final Verdict

Recommendation: Approve with changes

This is solid work that adds real value. The architecture is sound and the documentation is thorough. Address the HIGH severity issue (#1) before merging, and consider the MEDIUM severity items for follow-up PRs.

Must-Fix Before Merge

  1. Remove or document show_full_output: true to prevent secret leaks

Should-Fix (Before or After Merge)

  1. Add Claude CLI installation check
  2. Clean up or document unused YAML fields
  3. Decide on PARTIAL verdict handling

Nice-to-Have

5-8. Performance, maintenance, and UX improvements

Great job on this foundation! The eval framework will make maintaining plugin quality much easier.


Confidence: 85% - Thorough review of all files, tested shell script logic mentally, but have not run the CI workflow end-to-end.
Key Assumption: The claude-code-action@v1 action's show_full_output does not redact secrets automatically.
Reversible: All issues have low-risk fixes that will not break existing functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant